Context for this discussion (also see #19):

alex [5:54 PM] 
Hey Arman, I talked with Ryan for a while today about allele specific transcript selection. He’s going to try find variant reads with an FM index query. Unclear how to reconcile the variant reads once they’re all gathered.

Anyway, it might be nice to benchmark against a dumb baseline. If you have time, what do you think of putting together something that just takes aligned RNA, finds all the reads containing a variant (probably using varlens) and then just the most common nucleotide for each flanking position?

arman [5:55 PM] 
sure, sounds easy enough

[5:55] 
so the input files will be a VCF file and a BAM file and we will output some summary stats on each variant if I understand you correctly

[5:56] 
but what do you exactly mean by a flanking position? The bases next to the variant?

alex [5:57 PM] 
Yeah. If you center the variant nucleotides, and then count up the {A,C,T,G} content at every position -1,-2,-3,&c to the left and +1,+2,+3, &c to the right (and normalize by number of reads containing those positions), can we just take the most common?

[5:58] 
The output could be something like:

-5 -4 -3 -2 -1 _ _ +1 +2 +3 +4 +5
A 0.1 0.2 
C 0.9
T
G
(edited)

[5:59] 
Oh jeez, that’s getting tedious to fill out

arman [5:59 PM] 
yeah, sure -- something like a PSM representing a motif

alex [5:59 PM] 
But you’d have independent probability distributions over {A,G,T,G} at each position before and after the variant nucleotides

arman [6:00 PM] 
happy to do that. I will try to get something going during the weekend but the worst case scenerio, we will have it by the end of Monday.

alex [6:00 PM] 
Extending out to something like 75bp in both directions (since 25mer vaccine peptide = 75bp)

[6:00] 
The variant nucleotides should have no uncertainty in them since we filter the reads to definitely contain the variant

[6:01] 
And the best estimate of the sequence would be just taking the most probable nucleotide at each position

[6:01] 
It’s a bit naive since it treats positions independently, but if the entropy is low enough (i.e. not much splicing diversity) then could be OK

[6:02] 
And relying on the alignments is, one the one hand making it harder to detect weird splices, but on the other hand could be safer than reference-free fishing for matching sequences.

arman [6:03 PM] 
I agree. I think this would be great start and there is always the option to extend what we have from this and add some experimental stuff on top of it.

[6:03] 
Cool, sounds like a great plan.

alex [6:05 PM] 
Awesome

In [22]:
import pysam
import numpy as np
import pandas as pd

def contexify(samfile, chromosome, location, allele, radius):
    # This will be our score board
    counts = np.zeros(shape=((radius * 2) + 1, 5)) # 5 slots for each of the bases
    d = pd.DataFrame(counts,
                     index=range(location - radius, location + radius + 1),
                     columns=['A', 'C', 'G', 'T', 'N'])

    # Let pysam pileup the reads covering our location of interest for us
    for column in samfile.pileup(chromosome, location, location + 1):
        # By default, our region is bigger than we want,
        # so skip the other columns for now
        if column.pos != location:
            continue

        # Iterate over reads covering our location
        for read in column.pileups:
            if not read.is_del and not read.is_refskip:  # clean up
                pos = read.query_position  # relative location
                seq = read.alignment.query_sequence  # read

                # Filter reads that do not support the variant
                if seq[pos] is not allele:
                    continue

                # Cursor is compatible with our score board indexing
                cursor = location - pos - 1
                for base in seq:  # Move along the sequence one base at a time
                    cursor = cursor + 1

                    # If the region is beyond our window, discard the data
                    if (cursor < location - radius) or (cursor > location + radius):
                        continue

                    # Count++ within our scoreboard
                    count = d[base][cursor]
                    d[base][cursor] = count + 1

    return d.transpose()  # Transpose it to make it look more natural

Inspired by varlens examples, here is how this simple function works:


In [26]:
import urllib

urltemplate = "https://raw.githubusercontent.com/hammerlab/varlens/master/test/data/CELSR1/bams/{}"
url = urllib.URLopener()
url.retrieve(urltemplate.format("bam_5.bam"), "bam_5.bam")
url.retrieve(urltemplate.format("bam_5.bam.bai"), "bam_5.bam.bai")

samfile = pysam.AlignmentFile("bam_5.bam", "rb")

# C -> T variant at this particular locus
chromosome = "chr22"
location = 46930258
radius = 5

Let's compare the contexts for the variant and the reference alleles:


In [27]:
allele1 = "T"
contexify(samfile, chromosome, location, allele1, radius)


Out[27]:
46930253 46930254 46930255 46930256 46930257 46930258 46930259 46930260 46930261 46930262 46930263
A 1 0 0 0 0 0 0 0 0 0 0
C 0 0 1 0 1 0 0 0 1 0 0
G 0 0 0 1 0 0 1 0 0 1 0
T 0 1 0 0 0 1 0 1 0 0 1
N 0 0 0 0 0 0 0 0 0 0 0

In [28]:
allele2 = "C"
contexify(samfile, chromosome, location, allele2, radius)


Out[28]:
46930253 46930254 46930255 46930256 46930257 46930258 46930259 46930260 46930261 46930262 46930263
A 1018 0 0 0 0 0 0 0 0 0 1
C 0 1 1038 0 1079 1091 0 0 1063 0 0
G 1 0 0 1061 0 0 1081 1 0 1035 0
T 2 1025 0 0 0 0 1 1071 0 0 1024
N 1 2 0 0 0 0 0 0 0 0 0